---
title: Training on large datasets
description: Describes sampling approaches that can be used when training on larger datasets.
---

# Training on large datasets {: #training-on-large-datasets }

<span style="color:red;font-size: 1rem"> `Robot 1`</span>

Hi Team&mdash;we've got a couple of questions from a customer around our data ingest limits. Appreciate if someone can answer or point me in the right direction.

**What are some good practices on how to handle AutoML and EDA for larger datasets?**

<span style="color:red;font-size: 1rem"> `Robot 2`</span>

We are still using this presentation based on my R&D work: 

!!! example "Presentation summary"

    **Background**

    The original training data converted from a SAS model has 21M records with 6425 features, the physical data size is about 260GB in CSV format. We want to feed in all 21M records with 960+ features, estimated data size will be about 40GB in CSV format.

    === "Dataset size"

        100GB: 99GB (training) + 1GB (external holdout)

    === "Environment"

        Self-managed: 96 CPUs, 3TB RAM, 50TB HDD
        
        SaaS: Cloud account with 20 modelling workers

    **Problem statement**

    Do we really need to train the model on a large dataset?

    **Divide and conquer approach**

    === "Option 1: Dataset sampling"

        1. Randomly sample the original dataset into sample of NGB.
        2. Run Autopilot.
        3. Deploy the recommended-to-deploy model (trained on 100% of the sampled dataset).

    === "Option 2: Feature sampling"

        4. Take a NGB sample, with all features.
        5. Run Autopilot.
        6. Run Feature Impact on the recommended-to-deploy model (trained on 100% of the sampled dataset).
        7. Use Feature Impact to select the features having impact more than 1%.
        8. Select those features (>= 1%) from the full dataset, and if the result is less 10GB, model all rows.
        9. If the result is > 10GB, randomly sample the dataset from Step 5 to 10GB.

    **Results**

    - Both divide and conquer approaches (dataset sampling and feature sampling) can challenge models trained on a full-size dataset.
    - Full-size trained model vs. dataset sampling: +1.5% (worst case) and +15.2% (best case).
    - Full-size trained model vs. feature sampling: -0.7% (worst case) and +8.7% (best case).
    - Feature sampling is suitable for datasets containing hundreds of features (or more) and can lead to models that are similar, or even superior, over all metrics (accuracy, training/scoring time) when compared to models trained on a full dataset.